Skip to content

Store infra configuration in a distinct repo that does not ride trains #91

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
djmitche opened this issue Sep 26, 2017 · 15 comments
Closed

Comments

@djmitche
Copy link
Contributor

djmitche commented Sep 26, 2017

Motivation

We have a handful of things that we often change, and that change globally, not in a ride-the-trains fashion:

  • worker-type names
  • server fingerprints (e.g., the hg server)
  • AMI IDs (like AMI sets -- and maybe we don't use this repo, depending on @walac's deployment process)
  • VPC configuration (subnetIds per region, security groups by name per region, etc.)
  • project configuration (does it do nightlies, do we listen for pushes, etc.)
  • resource URLs (tooltool, pypi, etc.)
  • worker implementations (is workerType X a docker-worker or tc-worker/docker-engine?)
  • allocations between worker types for transitions (like the buildbot-to-TC transition of the mac minis)
  • scopes accorded to each scm level [added 10/3]

I think we can store all of this in a repo for which everything -- inbound, central, beta, release, esr -- just looks at the latest commit for the current state. This would allow us to land changes that take effect immediately across projects, without having to uplift them.

Details

Repo Name: https://hg.mozilla.org/build/ci-configuration
Repo Permissions: initially scm_level_3, but we could lock this down further to releng/relops/tc. We will require review like most repos, but not the merge-to-production of the puppetagain repo.

File structure

No subdirectories - everything is in the top level. The temptation to make a subdirectory and have dynamically named files (like one file per project) should indicate this isn't the right place for that config.

Data is divided into files of mostly-unrelated information -- VPC config over here, project config over there -- each in a .yml file. The file should begin with a nice big comment explaining what it means and how it's interpreted.

In-tree Usage

Tasks that need this data (mostly decision/action/cron, but maybe some other things too) will use to check out this repository to a predictable location. We'll define some commonly-available function to open, parse, and return a file -- something like get_ci_configuration("worker-types").

[edit 9/29] Other out-of-tree tools, such as hooks that might need one piece of data quickly, might instead pull a raw file from the repo (similar to what we do with .taskcluster.yml now).

Lifetime Management

[edit 9/29] Stuff in this repo will need to live until no branch still refers to it -- including ESR. So, files should live "forever", although they can be commented to indicate "not used after Firefox 92" or something like that to help with later spring cleaning.

Change Log

[new 9/29] Commits to this repository can be reflected in https://wiki.mozilla.org/ReleaseEngineering/Maintenance to indicate the timeline of infrastructure changes.

Getting Started

[new 9/29] We'll start by moving project configuration into this repo, then begin to migrate other data as time allows and needs dictate.

Not Covered

There's work afoot to move a bunch of Taskcluster configuration (scopes, roles, hooks, etc.) to be defined and managed in-tree. As automation, that stuff belongs in-tree, although it may pull some data from this repository (such as the list of projects).

@djmitche djmitche self-assigned this Sep 26, 2017
@djmitche
Copy link
Contributor Author

Looking for feedback from anyone but specifically @indygreg, @jonasfj, @walac, and @gregarndt .

@indygreg
Copy link

I'm about to disappear for a few weeks, so don't expect a reply...

I want to challenge the assertion that "doesn't ride the trains" precludes the use of a repo like mozilla-central. It is certainly possible to always pull the latest commit from a named repo or branch. If this repo is shared, we can limit who can make changes to specific directories in that repo as a way of ensuring untrusted people don't make changes to critical infrastructure configs.

It's also worth noting that you can obtain an archive of a sub-directory of a Mercurial repo via https://hg.mozilla.org/. See the "archive" command/url documentation at https://hg.mozilla.org/mozilla-central/help/hgweb.

In this case, having the infra configs in mozilla-central has the advantage that infra changes appear as commits in a unified timeline with Firefox. So if you see a bunch of bustage shortly after an infra commit were made, you can start to draw conclusions.

Shifting gears...

scm_level_3 has a ton of people in it, most of whom don't care about infra configs. Given the sensitivity of infra configs, we almost certainly want a smaller group to have write access. I can tell you from previous RRAs that CloudOps shares this opinion.

The use of sparse checkouts is probably over-engineering if this is a standalone repo. Sparse checkouts are (in their current form at least) viewed as a short to medium term performance hack. The ideal solution is to use a virtual filesystem that exposes every file but without the scaling issues and the problems associated with invalidated assumptions around file presence. That being said, reducing the mental complexity by only exposing a subset of files is powerful and the concept of sparse checkouts will likely exist forever. So you can lean on it if you need to.

@jonasfj
Copy link

jonasfj commented Sep 27, 2017

  • Limiting write-permissions to more than just scm_level_3 sounds nice
  • Can we limit a write-access to a folder in mozilla-central? (how? is it transparent?)
    • Would we have to invent new config/infrastructure to manage such access permissions...
    • Benefit of using a different repo is that we can manage write-permissions using repo-access

having the infra configs in mozilla-central has the advantage that infra changes appear as commits in a unified timeline with Firefox

I see some benefits in this... But breakages in ESR branches (or any other branch) still won't get commits in their unified timeline.

@jonasfj
Copy link

jonasfj commented Sep 27, 2017

Note: I love the idea. This is much better than using secrets to store the hg fingerprint.

@walac
Copy link

walac commented Sep 27, 2017

I want to challenge the assertion that "doesn't ride the trains" precludes the use of a repo like mozilla-central. It is certainly possible to always pull the latest commit from a named repo or branch. If this repo is shared, we can limit who can make changes to specific directories in that repo as a way of ensuring untrusted people don't make changes to critical infrastructure configs.

This has limited use. Since anyone with scm level 3 can push to m-c (and possibly other branches too), they can always push new configs to a private repo and change in tree code to refer to this other repo, in which (s)he has full access.

Another sec related concern: say we change AMI ids to be stored in this repo. Today, we don't remove old AMIs when they are replaced by new ones. Suppose we have an old, unused AMI with a tool (like sshd) with a known vulnerability issue. A person may exploit this by changing the AMI id in his(er)'s private copy of the repo, change gecko repo to point to it and push to try, we will then get a vulnerable ec2 instance running in our infra. And bear in mind that even AMIs that are not updated due to security concerns may in the future be a vulnerable AMI, as a security vulnerability in a package present in this AMI can be discovered later. The obvious solution, in this case, is to remove the old AMIs when a new one is deployed. But as we all know, sometimes we might need to roll back AMIs. Then the solution is to make the AMIs rebuildable, but then we may come to the conclusion that storing AMI ids in its own repo might not be worth the trouble.

That's one example, of course, but other configs might have their own little problems.

@jonasfj
Copy link

jonasfj commented Sep 28, 2017

I din't think the goal is access control, as much as it is to keep some config from riding the trains. Access control would then be needed because said config would be pulled in by all branches.

change in tree code to refer to this other repo, in which (s)he has full access.

Not really, do scm_level_3 have access to all branches? If not you can make said branch reference a private fork of this config repo..

@walac
Copy link

walac commented Sep 28, 2017

Not really, do scm_level_3 have access to all branches? If not you can make said branch reference a private fork of this config repo..

I am not sure, but I believe so. Anyway, in my threat model, the attacker could use try (which requires scm_level_1).

@djmitche
Copy link
Contributor Author

Discussion has gone on a few threads here. Let me see if I can react to each individually.

Access Control

Ultimately, this is governing what happens with source repos that are freely modifiable by anyone with level 3 access, so allowing level-3 doesn't necessarily increase access. I'd be happy with tighter controls, either now (if it's easy) or later (if it's hard).

Regarding risks from try pushes -- the data in this repo is just data, not access. So a try push can no more deploy an AMI with this data than it can without. I did indicate in the proposal that I'm contemplating moving management of hooks, workerTypes, etc. in-tree, but that would be in a fashion where in-tree automation could check those values, but the scopes to change them would still be limited to a more select group of people. That will require getting secrets out of workerTypes, among other things, and as I suggested it's not directly covered by this proposal.

Timelines

I agree that the timelines for these changes would be outside those of the Firefox source repos. We already have infra configuration changing on a timeline outside that of the source repos -- when we deploy a new AMI, update a hook, modify a role, etc. This repo would serve as a way to capture and coordinate those changes.

We could certainly keep this is the Gecko tree, perhaps with per-directory permissions applied. But assuming we decided to use mozilla-central as the repo containing the "current state", we would then need the relevant people to have the ability to land changes to that directory directly on mozilla-central even in a broken tree, and to skip perfoming CI on such pushes. We would have all other branches, including esr, release, beta, inbound and autoland, looking at the tip of "default" on mozilla-central. And all of those branches would carry around a config directory that may look perfectly legitimate but is unuesd and may actually be completely out of date.

The upshot would be that if you see bustage on esr or inbound, you would not look for an infra commit on that repo -- you'd look for one on mozilla-central. And I think that's more confusing than looking for a commit on build/ci-configuration.

Change Log

Releng has https://wiki.mozilla.org/ReleaseEngineering/Maintenance that contains a list of infra changes. It's necessarily incomplete, but automation already exists to add changes from hg repos to this timeline. It would likely be trivial to add tracking of build/ci-configuration to that list, and the result would help interested parties to correlate bustage with infra changes more accurately.

/cc @lundjordan for thoughts on that topic..

Sparse Checkouts

OK, this was maybe a little too clever. Probably a rule that every file in the repo gets left "forever" is good enough. A comment at the top saying "no longer used as of Firefox 92" would probably help later spring cleaning.

Proposal adjusted accordingly.

@escapewindow
Copy link
Contributor

I agree with the proposal. I'm not 100% clear if ci-configuration is for gecko only, and we have a separate repo for standalone github projects; also, if we get a second large consumer of the taskcluster-as-a-service production cluster, are they going to have a separate ci-configuration repo, or will gecko and non-gecko configs live together? Either way, it seems like separating the taskcluster config from the gecko source tree makes sense.

@djmitche
Copy link
Contributor Author

djmitche commented Oct 3, 2017

As conceived, this would be Gecko-specific. Smaller projects and potential large consumers would be left to their own devices.

It's a little awkard to call this a Taskcluster RFC, since it's not really about Taskcluster proper, but it's part of our responsibilities to the larger firefox-ci effort.

@djmitche
Copy link
Contributor Author

djmitche commented Oct 9, 2017

Final comment period will last for one week. If you think there's something here on which we do not have agreement, please speak up during that time!

@djmitche
Copy link
Contributor Author

This is a stretch goal for me for this quarter, so no guarantees this will move quickly -- but at least the plan is clear.

@lundjordan
Copy link

Change Log

Releng has https://wiki.mozilla.org/ReleaseEngineering/Maintenance that contains a list of infra changes. It's necessarily incomplete, but automation already exists to add changes from hg repos to this timeline. It would likely be trivial to add tracking of build/ci-configuration to that list, and the result would help interested parties to correlate bustage with infra changes more accurately.

/cc @lundjordan for thoughts on that topic..

Late to the party and final comment phase but it's US holiday and I'm ripping through my backlog today.

Two thoughts:

  1. I ❤️ this rfc as well. It would solve a lot of integration and release problems that releng face. Great idea!

  2. wrt the above change log idea, I think regardless if we define this configuration in-tree or not, it would be great to bubble up changes so releng and interested parties are notified. https://wiki.mozilla.org/ReleaseEngineering/Maintenance is as good as place as any for now. Out of scope here but, in the near future, I'd like to revisit how and what changes should be tracked by releng in a post Buildbot infra world.

@djmitche
Copy link
Contributor Author

I've put the project config into the repo and adjusted tc-admin to match. I think that's about it. We can revisit merging changes into the Maintenance page (a brief glance suggests it's not easy). I emailed firefox-ci to find any stragglers still referring to production-branches.json, but I think those stragglers are few and minor. There are bugs filed to move some config into this repo already, and to move most of tc-admin in-tree. Hopefully the mere availability of this repo and code that references it will encourage its growth.

@djmitche
Copy link
Contributor Author

djmitche commented Feb 7, 2018

djmitche added a commit that referenced this issue Feb 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants